R Markdown

## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

In order to condcut PCA and MDS, big data should be pretreated. To get rid of “not associated” features in the data some steps are taken. Firstly, asian handicap odds are discarded, because it is very specific and there are too many empty odds. Then, to be consistent, latest odds are chosen for further calculations. However, there are still many “not associated” features which should be discarded. To determine the most meaningful part of the data, whole data is converted to logic matrix according to whether they are “not associated” or not. Finally, heatmap is plotted. From the heatmap, it is deduced that “bts”" bet type includes many “not associated” features. Subsequently, “bts” bet type is discarded and heatmap is plotted again.

Thanks to new heatmap, the 5 fullest bookmakers are determined, namely “youwin”, “888sport”, “Betfair Exchange”, “Sportingbet” and “Betsafe”. Then, Avarage of each features are taken by pivoting matchID to prepare PCA and DMS.

## List of 5
##  $ sdev    : num [1:10] 11.032 5.017 2.087 1.599 0.363 ...
##  $ rotation: num [1:10, 1:10] 0.005923 0.000182 0.002015 0.029889 0.009402 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:10] "1" "12" "1X" "2" ...
##   .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:10] 2.02 1.25 1.46 3.47 1.96 ...
##   ..- attr(*, "names")= chr [1:10] "1" "12" "1X" "2" ...
##  $ scale   : logi FALSE
##  $ x       : num [1:2886, 1:10] -7.4 -7.15 -8.7 -8.48 -7.84 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2886] "02oVDuv1" "04PCiQzK" "04zko0D5" "061xSktd" ...
##   .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"
## Importance of components:
##                            PC1    PC2     PC3    PC4     PC5     PC6
## Standard deviation     11.0321 5.0167 2.08699 1.5990 0.36265 0.25971
## Proportion of Variance  0.7903 0.1634 0.02828 0.0166 0.00085 0.00044
## Cumulative Proportion   0.7903 0.9537 0.98198 0.9986 0.99944 0.99988
##                            PC7     PC8     PC9    PC10
## Standard deviation     0.11031 0.07143 0.03348 0.02927
## Proportion of Variance 0.00008 0.00003 0.00001 0.00001
## Cumulative Proportion  0.99995 0.99999 0.99999 1.00000

After PCA, cumulative sum of PCs are plotted. From plot, it is seen that first 2 PCs cover approximately 80% of std of the data whereas first 3 PCs cover approximately 90% of std of the data.

First 2 PCs are merged with match results and over results are diffently colored. It is hard to deduce something from 2D PC plot.

For better investigation, first 3 PCs are plotted, unfortunatelly it is still hard to deduce a meaningful result.

Euclidean distance is used for MDS. The results are very similar to 2D PCA. Therefore, it is hard to deduce a meaningful result.

Manhattan distance is used for MDS. it is hard to deduce a meaningful result.

Relevant match results are merged with splitted data and PCA is conducted. Then, first two PCs are plotted, unfortunately it is hard to deduce meaningful results.

Then, first three PCs are plotted, it is clearly seen that home win results are seperated from the bulk.It is worth playing home win bet accordingly.

First three PCs are plotted, it is clearly seen that away win results are seperated from the bulk.It is worth playing away win bet accordingly.

First three PCs are plotted, it is not seen as clearly as home win results that tie results are seperated from the bulk. However, it is still worth playing tie bet accordingly.

## Loading required package: jpeg
## Loading required package: Matrix
## [1] 512 512   3

The original photo is plotted in the northwest position with green, red and blue channels clockwise.

The randomly noise added photo is plotted in the northwest position with green, red and blue channels clockwise.

The original photo is converted to greyscale and plotted.

## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4
## Standard deviation     0.5670862 0.08920314 0.07709464 0.03947536
## Proportion of Variance 0.9460196 0.02340789 0.01748439 0.00458410
## Cumulative Proportion  0.9460196 0.96942754 0.98691192 0.99149602
##                             Comp.5      Comp.6      Comp.7       Comp.8
## Standard deviation     0.034739622 0.027315362 0.018617140 0.0180273675
## Proportion of Variance 0.003550195 0.002194906 0.001019596 0.0009560192
## Cumulative Proportion  0.995046217 0.997241123 0.998260718 0.9992167377
##                              Comp.9
## Standard deviation     0.0163174619
## Proportion of Variance 0.0007832623
## Cumulative Proportion  1.0000000000
## List of 7
##  $ sdev    : Named num [1:9] 0.5671 0.0892 0.0771 0.0395 0.0347 ...
##   ..- attr(*, "names")= chr [1:9] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ loadings: 'loadings' num [1:9, 1:9] 0.331 0.335 0.331 0.336 0.339 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:9] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ center  : num [1:9] 0.485 0.486 0.487 0.486 0.486 ...
##  $ scale   : num [1:9] 1 1 1 1 1 1 1 1 1
##  $ n.obs   : int 260100
##  $ scores  : num [1:260100, 1:9] -0.764 -0.783 -0.783 -0.794 -0.771 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:9] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ call    : language princomp(x = patches)
##  - attr(*, "class")= chr "princomp"

This is printed using only the data making up the data from the first PC. It is seen that it is close to the original photo.

This is printed using only the data making up the data from the secodnd PC. It is seen that it focuses lower edges.

This is printed using only the data making up the data from the third PC. It is seen that it focuses upper edges.

The PC1 is focused mostly on center pixel.

The PC2 is focused mostly on south western pixel.

The PC3 is focused mostly on north western pixel.

APPENDIX

require(data.table)
require(anytime)
library(plotly)

#=========================
#section 0

matches<- readRDS("./582/df9b1196-e3cf-4cc7-9159-f236fe738215_matches.rds")
odds <- readRDS("./582/df9b1196-e3cf-4cc7-9159-f236fe738215_odd_details.rds")


odds=odds[order(matchId, oddtype,bookmaker,date)]

odds_final=odds[,list(final_odd=odd[.N]),
                            by=list(matchId,betType
                                  ,oddtype,bookmaker)]
odds_final=odds_final[betType!="ah"]
odds_final[,"betType":=NULL]

wide_odds_final=dcast(odds_final,
                           matchId+bookmaker~oddtype,
                           value.var='final_odd')
wide_odds_final=wide_odds_final[order(bookmaker,matchId)]
wide_odds_final_hold=wide_odds_final
# as.numeric(is.na(wide_odds_final$YES))
wide_odds_final=wide_odds_final[, lapply(.SD, function(x) as.numeric(is.na(x))), by = list(bookmaker, matchId)][order(bookmaker, matchId)][,"matchId":=NULL]
                # YES := as.numeric(is.na(YES))]

bookmaker=odds_final$bookmaker
bookmaker=unique(bookmaker)

compressed_data=wide_odds_final[, lapply(.SD, function(x) sum(x)), by =bookmaker]

rownames(compressed_data) = compressed_data$bookmaker
compressed_data=compressed_data[,bookmaker:=NULL]
heatmap(as.matrix(compressed_data),labRow=rownames(compressed_data))


compressed_data_no_bts=compressed_data[,c("YES","NO"):=NULL]

heatmap(as.matrix(compressed_data_no_bts),labRow=rownames(compressed_data))



wide_odds_final_hold_na1=wide_odds_final_hold[bookmaker %in% list("youwin","Sportingbet","888sport","Betfair Exchange","Betsafe")][,,c("YES","NO"):=NULL]
wide_odds_final_hold_na1=wide_odds_final_hold_na1[,c("YES","NO"):=NULL]

spilited_data=na.omit(wide_odds_final_hold_na1)
#complete.cases(wide_odds_final_hold)

spilited_data=spilited_data[,"bookmaker" := NULL ][,lapply(.SD, function(x) mean(x)),by=matchId]
# spilited_data=spilited_data[,lapply(.SD[,-1][,-1], function(x) mean(x)),by=matchId]

spilited_data=unique(as.data.frame(spilited_data))
rownames(spilited_data) = spilited_data[["matchId"]]

#end of section 0
#==================

#case 1

pca=prcomp(spilited_data[,-1])

str(pca)
summary(pca)

x_axis=cumsum(pca$sdev)/sum(pca$sdev)*100
plot(x_axis,main="Cumulative Sum of PCAs",
     xlab="Number of PCA", ylab="Coverage Percentage")



dt = as.data.table(pca$x)
dt = dt[, matchId :=spilited_data[,1]] 

matches_case1=unique(matches)
matches_case1=matches_case1[order(home,-date)]
matches_case1[,c("HomeGoals","AwayGoals"):=tstrsplit(score,':')]
matches_case1$HomeGoals=as.numeric(matches_case1$HomeGoals)
matches_case1[,AwayGoals:=as.numeric(AwayGoals)]
matches_case1[,TotalGoals:=HomeGoals+AwayGoals]
matches_case1[,IsOver:=0]
matches_case1[TotalGoals>2,IsOver:=1]
matches_case1=matches_case1[,c("matchId","IsOver")]

dt_merge = merge(dt, matches_case1, by= "matchId")

plot(dt_merge[["PC1"]], dt_merge[["PC2"]],main="Distribution of Over/Under Match Results in 2D_PCA",
     xlab="PC1", ylab="PC2", col = ifelse(dt_merge[["IsOver"]], "red", "blue"))
legend("top",legend = c("Over","Under"),col=c("red","blue"),cex=1,pch=1)



a=data.frame(dt_merge[,c("PC1","PC2","PC3","IsOver")])
p=plot_ly(a,x = ~PC1, y = ~PC2, z = ~PC3,
  marker = list(color = ~IsOver))
  add_markers(p)
  

#=================
#case 2
d= dist(spilited_data[,-1], method = "euclidean")
fit <- cmdscale(d,eig=TRUE, k=2)
     dt_2=as.data.table(fit$points)
     dt_2 = dt_2[, matchId :=spilited_data[,1]] 
     dt_merge_2 = merge(dt_2, matches_case1, by= "matchId")
     plot(dt_merge_2[["V1"]], dt_merge_2[["V2"]],main="Distribution of Over/Under Match Results in 2D-MDS(euclidean)",
     xlab="PC1", ylab="PC2", col = ifelse(dt_merge[["IsOver"]], "red", "blue"))
     legend("top",legend = c("Over","Under"),col=c("red","blue"),cex=1,pch=1)
   
d2= dist(spilited_data[,-1], method = "manhattan")
     fit2 <- cmdscale(d2,eig=TRUE, k=2)
     dt_3=as.data.table(fit2$points)
     dt_3 = dt_3[, matchId :=spilited_data[,1]] 
     dt_merge_3 = merge(dt_3, matches_case1, by= "matchId")
     plot(dt_merge_3[["V1"]], dt_merge_3[["V2"]],main="Distribution of Over/Under Match Results in 2D-MDS(manhattan)",
          xlab="PC1", ylab="PC2", col = ifelse(dt_merge[["IsOver"]], "red", "blue"))
     legend("top",legend = c("Over","Under"),col=c("red","blue"),cex=1,pch=1)

#================
#Q2-home
     matches_case2=unique(matches)
     matches_case2=matches_case2[order(home,-date)]
     matches_case2[,c("HomeGoals","AwayGoals"):=tstrsplit(score,':')]
     matches_case2$HomeGoals=as.numeric(matches_case2$HomeGoals)
     matches_case2[,AwayGoals:=as.numeric(AwayGoals)]
     matches_case2_home=matches_case2[,Result:=0]
     matches_case2_home[HomeGoals>AwayGoals,Result:=1]
     matches_case2_home=matches_case2_home[,c("matchId","Result")]
     
     
     dt_merge_home = merge(dt, matches_case2_home, by= "matchId")
     
     plot(dt_merge_home[["PC1"]], dt_merge_home[["PC2"]],main="Distribution of Home Win Match Results in 2D_PCA",
          xlab="PC1", ylab="PC2", col = ifelse(dt_merge_home[["Result"]], "red", "blue"))
     legend("top",legend = c("Home Win","others"),col=c("red","blue"),cex=1,pch=1)

     a_home=data.frame(dt_merge_home[,c("PC1","PC2","PC3","Result")])
     
     phome=plot_ly(a_home,x = ~PC1, y = ~PC2, z = ~PC3,
               marker = list(color = ~Result))
     add_markers(phome)
 
#Q2-away    

     matches_case2_away=matches_case2[,Result:=0]
     matches_case2_away[HomeGoals<AwayGoals,Result:=1]
     matches_case2_away=matches_case2_away[,c("matchId","Result")]
     
     
     dt_merge_away = merge(dt, matches_case2_away, by= "matchId")
     a_away=data.frame(dt_merge_away[,c("PC1","PC2","PC3","Result")])
     
     paway=plot_ly(a_away,x = ~PC1, y = ~PC2, z = ~PC3,
                   marker = list(color = ~Result))
     add_markers(paway)

#Q2-tie    
     
     matches_case2_tie=matches_case2[,Result:=0]
     matches_case2_tie[HomeGoals==AwayGoals,Result:=1]
     matches_case2_tie=matches_case2_tie[,c("matchId","Result")]
     
     
     dt_merge_tie = merge(dt, matches_case2_tie, by= "matchId")
     a_tie=data.frame(dt_merge_tie[,c("PC1","PC2","PC3","Result")])
     
     ptie=plot_ly(a_tie,x = ~PC1, y = ~PC2, z = ~PC3,
                   marker = list(color = ~Result))
     add_markers(ptie)
 
     
#===================== 
# Q3-1,2
require(jpeg)
require(grDevices)
require(Matrix)

photo=readJPEG("./582/odevv.jpg")
dim(photo)

plot(c(0, 100), c(0, 100), type = "n", xlab = "", ylab = "")
rasterImage(photo[,,3] ,0, 0, 50, 50)
rasterImage(photo ,0, 50, 50, 100)
rasterImage(photo[,,2] ,50, 50, 100, 100)
rasterImage(photo[,,1] ,50, 0, 100, 50)

#===========================
#Q3-3
M1 <- rsparsematrix(512, 512,  nnz = 512*512,  rand.x = runif)
M2 <- rsparsematrix(512, 512,  nnz = 512*512,  rand.x = runif)
M3 <- rsparsematrix(512, 512,  nnz = 512*512,  rand.x = runif)

M11=as(M1,"matrix")
M11=M11/10
M22=as(M2,"matrix")
M22=M22/10
M33=as(M3,"matrix")
M33=M33/10

photo_noise=photo
photo_noise[,,1] = photo[,,1] +M11*1
photo_noise[,,2] = photo[,,2] +M22
photo_noise[,,3] = photo[,,3] +M33

photo_noise[which(photo_noise>1)]=1
photo_noise[which(photo_noise<0)]=0

plot(c(0, 100), c(0, 100), type = "n", xlab = "", ylab = "")
rasterImage(photo_noise[,,3] ,0, 0, 50, 50)
rasterImage(photo_noise ,0, 50, 50, 100)
rasterImage(photo_noise[,,2] ,50, 50, 100, 100)
rasterImage(photo_noise[,,1] ,50, 0, 100, 50)


#==========================   
#Q3-4

greyphoto=(photo_noise[,,1]+photo_noise[,,2]+photo_noise[,,3])/3
plot(c(0, 100), c(0, 100), type = "n", xlab = "", ylab = "")
rasterImage(greyphoto ,0, 0, 100, 100)

patches <- matrix(1:2340900, nrow = 260100, ncol = 9)

n=1
for (p in 1:510) {
for (r in 1:510) {
  k=0
  for (i in 1:3) {
    for (j in 1:3) {
      k=k+1
      patches[n,k]=greyphoto[p+i-1,r+j-1]
    }
    
  }
n=n+1
  }
}

pca2=princomp(patches)
summary(pca2)
str(pca2)

pca2$scores[which(pca2$scores>1)]=1
pca2$scores[which(pca2$scores<0)]=0

#PC1
bitti= pca2$scores[,1]
bitti <- matrix(data=bitti, nrow = 510, ncol = 510)
bitti=t(bitti)

plot(c(0, 100), c(0, 100), type = "n", xlab = "", ylab = "")
rasterImage(bitti ,0, 0, 100, 100)



#PC2
bitti2= pca2$scores[,2]
bitti2 <- matrix(data=bitti2, nrow = 510, ncol = 510)
bitti2=t(bitti2)

plot(c(0, 100), c(0, 100), type = "n", xlab = "", ylab = "")
rasterImage(bitti2 ,0, 0, 100, 100)


#PC3
bitti3= pca2$scores[,3]
bitti3 <- matrix(data=bitti3, nrow = 510, ncol = 510)
bitti3=t(bitti3)

plot(c(0, 100), c(0, 100), type = "n", xlab = "", ylab = "")
rasterImage(bitti3 ,0, 0, 100, 100)

#Q3-4-c
dans1=matrix(data=pca2$loadings[,1],nrow=3,ncol=3)
image(dans1)

dans2=matrix(data=pca2$loadings[,2],nrow=3,ncol=3)
image(dans2)


dans3=matrix(data=pca2$loadings[,3],nrow=3,ncol=3)
image(dans3)